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Abstract: We present a class of sparse generalized linear models that include 
probit and logistic regression as special cases and offer some extra flexibility. 
We provide an EM algorithm for learning the parameters of these models from 
data. We apply our method in text classification and in simulated data and 
show that our method outperforms the logistic and probit models and also the 
elastic net, in general by a substantial margin. 

1. Introduction 

The standard approach to model the dependence of binary data on explanatory vari- 
ables under the Generalized linear models setting, is through a cumulative density 
function (cdf) 'J. For example, for a vector of explanatory variables x and a random 
variable y that takes values in {0, 1}, the conditional probability of y given x and 
a vector of parameters (3 is modelled using a cdf ty, i.e. Pr(y = l\x,(3) = ^>(x T /3). 
The most commonly used cdfs are the logistic and normal cdfs. The corresponding 
"link functions", are the logit and probit link functions respectively. 

Albert and Chib [l| proposed a Student-t inverse cdf as the link funtion. This 
model includes the logit and probit as special cases, at least approximately. For 
probabilities between 0.001 and 0.999, logistic quantiles are approximately a linear 
function of the quantiles of a Student-t distribition with 8 degrees of freedom. Also, 
when the degrees of freedom in a Student-t distribution are large, (say v > 20) the t- 
link approximates the probit model. The degrees of freedom v control the thickness 
of the tail of the t-density allowing for a more flexible model. One can also benefit 
from this model as it can be presented via a latent variable representation that 
allows one to estimate the parameters of the model easily using the EM algorithm 
(Dempster et al. [H). 

For the logistic, normal and Student-t, the corresponding probability density 
functions (pdf) are symmetric around zero. This implies that their corresponding 
cdfs approach 1 at the same rate that they approach 0, which may not always be 
reasonable. In some applications, the overall fit can significantly improve by using 
a cdf that approaches and 1 at different rates. 

Many authors have proposed models with asymmetric link functions that can 
approximate, or have as special cases, the logit and probit links. Stukel [151 ] proposes 
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a class of models that generalizes the logistic model. Chen et al. [J] introduce an 
alternative to the probit models where the cdf is the skew-normal distribution from 
Azzalini and Delia Valle Q. Fernandez and Steel @ propose a class of skewed 
densities indexed by a scalar 6 £ (0, oo) of the form: 

(i) p(y\s) = ^-tU(§ )i[o,°o) (y) + f(y5)i ( -oo,o) (2/)}, y e 5? 

where /(.) is a univariate probability density function (pdf) with the mode at 
and symmetry around the mode. The parameter S determines the amount of mass 
at each side of the mode, and hence the skewness of the densities. Capobianco 0]) 
considers the Student-t pdf as the univariate /(.) density in equation (1). This is 
an appealing model as it contains a parameter that controls the thickness of the 
tails and another parameter that determines the skewness of the density. 

We apply an extended version of this model to textual datasets, where the prob- 
lem is to classify text documents into predefined categories. We consider a Bayesian 
hierarchical model that contains, in addition to the parameter that controls the 
skewness of the density and the parameter that controls the thickness of the tails, 
a third parameter that controls the sparseness of the model, i.e. the number of 
regression parameters with zero posterior mode. In studies when there are a large 
number of predictor variables, this methodology gives one way of discriminating 
and selecting relevant predictors. 

In what follows we describe the model that we propose. 



2. The model 



Suppose that n independent binary random variables Y% , . . . , Y n are observed to- 
gether with a vector of predictors xi, . . . , x„, where Yi = 1 with probability PiYi = 
1 1/3, x^. Under the generalized linear model setting, models for binary classification 
satisfy P(Y* = l|/3, Xj) = \&(x^/3) where ^ is a nonnegative function whose range is 
between and 1. For instance, the probit model is obtained when "J is the normal 
cumulative distribution and the logit model when 'J is the logistic cdf. 

Under Bayesian learning one starts with a prior probability distribution for the 
unknown parameters of the model. Prediction of new data utilizes the posterior 
probability distribution. More specifically, denote by 7r(/3) the prior density function 
for the unknown parameter vector f3. Then the posterior density of f3 is given by 

i"(/3|{(xi,j/i),...,(x n ,y n )}) = 



and the posterior predictive distribution for y given x is 

7r(?/|x,{(xi,yi), . . . , (x„,y„)}) = / Pr(y\x, /3)tt(/3|{(xi, y x ), . . . , (x„, y„)})df3 

which in general are intractable due to the many integrals. 

In the model that is proposed in this paper, we estimate the vector of parameters 
(3 as the mode of the posterior density 7r(/3|{(xi, yx), . . . , (x„, y„)}) and prediction 
of new data is performed using the following rule: 

(2) y = 1 if 7r(y = l|x, {(x 1; y x ), (x„, y n )}) > 0.5, 

y = if Tr(y = l|x, {(xi, j/i), . . . , (x„,y„)}) < 0.5 
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which is an optimal rule in the sense that it has the smallest expected prediction 
error (see Hastie et al. [10] for more details). 

We consider two general cases for the form of First, we consider a cdf $ that 
approaches 1 and at the same rate. This is referred to hereon as the symmetric 
case. Second, we generalize this model assuming that the inverse of the link function 
is the cdf of a skewed density. In this way, the inverse of the link function approaches 
and 1 at different rates. 

2.1. Symmetric case 

The model that we propose assumes that the conditional distribution of Yi given 
a vector of predictors x and a vector of unknown parameters f3 is determined 
by the cdf of the Student-t distribution with v degrees of freedom evaluated at 
xj (3 i.e. P(Yi = l|/3,x,;) = T v (xf(3). Figure Q] shows how the Student-t cdf can 
approximate the normal and logistic cdfs. The black continuous line corresponds to 
the normal cdf, the dashed line correspond to the logistic cdf and the dotted lines 
to the Student-t cdf with different degrees of freedom. Also Figure [2] shows the 
logistic quantiles against the quantiles of a Student-t distribution with 8 degrees of 
freedom for probabilities between 0.0005 and 0.9995. The straight line corresponds 
to the linear model fit. 

Assume that apriori the distribution of (3j is normal with mean and variance 
Tj, N(0,Tj) and the distribution of the hyperparameters Tj is exponential exp(2/j) 
with pdf equal to ^e _7Tj / 2 . Integrating with respect to Tj, one obtains Pr(f3j\^f) = 

^exp(—y/j\\Pj\\), which is the double exponential prior. In Section[3]it will became 
clear that decomposing the double exponential into a two-level Bayesian hierarchical 
model, allows us to estimate the parameters (3 via the EM algorithm, where in 
addition to the latent variables Z (introduced below) the Tj parameters are seen as 
missing data. 




-1 1 1 i r 

-10 -5 5 10 
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Fig 1. The black continuous line corresponds to the normal cdf, the dashed line correspond to the 
logistic cdf and the dotted lines to the Student-t cdf with different degrees of freedom. 
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Fig 2. Plot of the logistic quantiles against the quantiles of a Student-t distribution with 8 degrees 
of freedom for probabilities between 0.00005 and 0.99995. The strait line corresponds to the linear 
model fit. 



The normal prior on the (3 parameters has the effect of shrinking the maximun 
likelihood estimator of (3 towards zero, which has been shown to give a better 
generalization performance (see Hastie et al. for more details). A different 
variance in the priors of the f3 gives the flexibility of having the parameters shrunk 
by a different amount which is relevant when the predictors influence the response 
unevenly. Moreover, the particular distribution of the variances Tj that we use, 
will shrink some of the parameters [3j to be exactly equal to zero. If the hyperprior 
distribution for Tj places significant weight on small values of Tj then it is likely that 
the estimate of the corresponding [3j will also be small and can have zero posterior 
mode. On the other hand, hyperprior distributions that favor large values of the tj's 
will lead to posterior modes for the /3j's that are close to the maximum likelihood 
estimates. Note that, since Eijj) — i for all j, 7 effectively controls the sparseness 
of the model. Figure [3] shows the different shapes that the hyper distribution for Tj 
can take as the parameter 7 varies. Analogous models have been applied by many 
authors e.g. Genkin et al [§], Figueiredo and Jain 0, Neal and MacKay [n|. 



We show below that by introducing some latent variables, the generalized linear 
model that takes to be equal to the Student-t cdf T v with v degrees of freedom, 
can be seen as a missing data model and hence amenable to EM algorithm. This 
procedure offers a tractable way to estimate the parameters of the model. The latent 
variable representation for this model was first introduced by Albert and Chib [H, 
who derive a Gibbs sampling algorithm to find an estimator of the parameter (3. 



In the same spirit, Scott [1J| introduced a latent variable representation for the 
logistic model. 



so 
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Fig 3 . Shapes that the exponential hyperprior for the Tj parameters takes when the hyper param- 
eter 7 changes. 



Let Z\,...,Z n be n independent random variables where Zi is normally dis- 
tributed with mean xf (3 and variance A,^ 1 , N(xf f3, X^ 1 )- Define the probability 
of Yi given Zi as a Bernoulli distribution with parameter equal to the probability 
that Zi is non-negative Pr(Zi > 0) i.e. Yi = 1 if Zi > and Yi = otherwise. Let 
the inverse of the variance of Z%, Xi be distributed as Gamma(v /2, 2/v) with pdf 
proportional to X^ 2 ~ 1 exp(~vX i /2). 

Note that the marginal distribution p(z\(3, x, v) is t v (x T f3), a Student-t density 
with v degrees of freedom and location x T f3, and p(y = l|/3, x) = p(z > 0|/3, x) = 
T v (x T (3) is the Student-t cumulative distribution with v degrees of freedom, cen- 
tered at and evaluated at x T f3. Therefore, we see that by integrating the latent 
variables Zi we recover the original model. 

2.2. General case 

Fernandez and Steel propose a class of skewed densities indexed by a scalar 
5 e (0, oo) of the form: 

2 y 

P(V\S) = j-rMf )/[o,oo)(!/) + fW{ -oo,0) 
o + s 

where /(.) is a univariate probability density function (pdf) with mode at 
and symmetry around the mode. The parameter 5 determines the amount of mass 
at each side of the mode, and hence the skewness of the densities. When 5 varies 
between (0, 1), the distribution has negative skewness and when S is bigger than 1, 
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the distribution has positive skewness. We replace the function /() in equation (1) 
by the Student-t distribution and consider the corresponding cdf which we denote 
by T v>s . 

Let Y±, . . . ,Y n be n binary independent random variables distributed Bernoulli 
with probability of success given by P(Yi = l|/3, Xj) = T v j(xf (3). As in the sym- 
metric case of Section 12.11 we introduce some latent variables that will allow us to 
find the parameter estimates using the EM algorithm. 

Consider n random variables Z\ , . . . , Z n such that the distribution of Zi is the 
skew normal with mode at xf f3 given by 



(3) p(Zi\Xi,P,\i,6) = — -y. 



Hn-xff3) 



5 




where n = > xf ff) + z l 81{z i < xff3). Define Yi = 1 if Zi > and Y t = 

otherwise. Note that we can derive the pdf of equation (2) by replacing the /() 
function of equation (1) with the normal distribution with mean xf f3 and variance 
Aj and accordingly divide the mass of the distribution at each side of the mode xj (3. 
Assume, as in the symmetric case, that a priori the distribution of j3j is normal 
with mean and variance Tj, the distribution of Aj is Gamma(v/2,2/v) and the 
distribution of the hyperparameters Tj are exponential exp(2/"f). See Figure U for 
a graphical model representation of this model. 

Note that by marginalizing the pdf in equation (2) with respect to Xi we get, 

P ( Zl \x u( 3,S) = 2 ^[fh l + ^" a:/ /3i " 
( 4 ) x [^[xfP^i) + ^h-oo.xfpMW 



V 



■J+l 



which is the skew Student-t distribution with mode at xj f3, v degrees of freedom 
and 8 the parameter that controls the skewness of the pdf. Figure [5] shows this pdf 
for different vales of 5 and 8 degrees of freedom. The continous line corresponds 
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Fig 5. This graph shows the different shapes that the skew Student-t distribution with 8 degrees 
of freedom. The continues line correspond to the symmetric case. 



to the symmetric Student-t distribution. We can derive the pdf from equation (3) 
by replacing the function /() from equation (1) with a Student-t distribution with 
mean xf (3 and v degrees of freedom and accordingly divide the mass of the pdf at 
each side of the mode. 



3. The algorithm 

3.1. The EM algorithm 

In this section we derive the equations for the EM algorithm in the symmetric and 
general cases explained above. 

The EM algorithm, originally introduced by Dempster, Laird and Rubin [5(, 
provides an iterative procedure to compute maximum likelihood estimators (A1LE) . 

Each iteration in the EM algorithm has two steps, the E-step and the M-step. 

E-step: Compute the expected value of the complete log-posterior, given the 
current estimate of the parameter and the observations, usually denoted as Q. 

M-step: Find the parameter value that maximizes the Q function. 



3.2. Symmetric case 

The complete log posterior for (3 is given by 



logp(/3|y, z, \,v) oc -(z - Hf3) T A(z - H0) - (3 T W(3 
(5) oc 2f3 T HAz - (3 T H T AH/3 - f3 T W(3 
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where A = diag(X\, . . . , X n ), H = [xi,...,x n ] T is the design matrix and W — 
diag{ri , . . . , r" 1 ) is the covariance matrix for the normal prior on (3. 

From equation (5) we see that the matrices A and W and the vector z corre- 
spond to the missing data. Therefore, the E-step needs to compute E(t^ x |y, ^ \ 
j,v), E(X i \y,0 ( - t '>,^,v) and E(zi\i\y, fi^,^,v) for i = 1, . . . , n. 

In order to do these computations we need first to get the pdfs p(zi\$^\ yi, Xi, 
7,i>) and p(Xi\0^ t \yi,v). Note that the conditional probability of Zi given 
yi, Xi, 7 and v is a normal distribution with mean xfd and variance Xi but left 
truncated at zero if yi = 1 and right truncated at zero if yi — 0. The posterior 
probabilities for A^ also have closed form given by, 



p{X l \(3 {t) ,y-,v) 



p(\ t \y)H-VT t xJj3) „ 

T v (-Xf/3) Ul ~ ' 

where $ denotes the cumulative distribution for the standard normal with mean 
and variance 1. 

Now we can compute the expected value for Aj and Zi, which are given by 



(6) 



and 



(7) 



E{Xi\p [t) ,y u v) 



T V+2 ( X J(3^) 



T v+2 (- x Jl3^) 
T v {-Xj(3) 



Hi = o, 



1 0(v^7/3 (t) /i(x"))) 
v / A7*(\A7/3 (t) '»(xW)) 



Vi = 1, 



respectively. The next computation in the E-step is E(XiZi\yi, 0^\v). Observe that 
E{XiZi\ yi ,$W,v) = E{E(XiZi\ yi ,$W,v,Xi)} = E{XiE(zi\yiJW,v,Xi)}. There- 
fore by replacing first cq. (7) and then eq. (6) in this expectation we obtain, 



(8) 



J B(A i z J |y l ,/3 ( \v) = 
( x T /3T v+2 (x T Py/^) t v (x T /3) 

T v (X T /3) T v (X T fl) 



Vi = i, 



x T (3t v+2 (-x t py^) _ t v (x T f3) = fl 

T v (-X T fl) T v (-X T j3) Ul 



Finally, the expectation for tj is given by, 



E(T- 1 \yJ( t \ 1 ,v) = 



1 
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Denote, 

W* = diag(E(T^\y, , 7, v),... , E^ 1 |y, $<■*) , 7, v)), 
A* = diag(E{X 1 |y, /§(*) , 7, u), . . . , £(A„ |y, j§W , 7, v)), 
z* = (£(ziAi|y, /3W , 7, «),-.. , S(^A„|y, £<*> , 7, ^)) T 

Then the M-step that results from maximizing equation (5) with respect to f3 is 
given by, 



(10) 



$= (H 1 A*H + W*y i H 1 z 



Summarizing, in the (t + l)th iteration of the EM algorithm, 

E-step: Compute W* , A* and z* using equations (8), (5) and (7) respectively. 

~(t+l) 

M-step: Obtain a new estimate (3 by replacing the new values of VT*,yl* 

and z* in eq. (9). If ||/3 (t+1) - /3 (t) ll/ll/3 (i) ll < A stop, otherwise go back to the 
E-step. We fix A = 0.005 following Figuereido and Jain 0]. 

3.3. General case 

In the general case of Section 4.2, the complete log-posterior can be written as 

logp(/3|A,v,y,z,<f) 
oc -(r - Hf3) T A{v - Hp) - (3 T W(3 



(11) 



oc 2(3 T HAr - f3 T H T AH/3 - /3 T W(3 



where = ^-I(zi > 0) + ZiSI(zi < 0), W and A are defined in Section 5.1. To 
get the new equations for the EM algorithm we have to compute E(Xiri\yi, ftw, v) 
and E(Xi\yi, v) for i = 1, .. . ,n. Using the same trick that we used before, i.e. 
E(XiZi\yi,$^ t \v,S) = EiEiXiZilyiJ^^v^XiJ)} = E{\iE{zi\y u ^,v, \,S)}, 

we obtain E(XiTi\yi, p ,v, Xi) for y t — and y^ = 1 in the following way: 
E(Xiri\yi = l,p,v,\i) = 



(12) 



T„. s (XfP) + ° T v , s (XfP) 

xjp {T v+3 MxTP(8-l)y/W)-T»+iA-xTPVW 

S T v6 (Xib T P) 
1 {t v , a (-XTP)-t„, 6 (xTP(8-l))} 
S T V , S (X?P) 
SXJpT v + 2 _ s (XJp^±?) , t v . s (-Xfp) 
T v {XjP) + T^s(XfP) 



xfp > 0, 
xjp < 0, 



E(Xtri\yi = 0,P,v,Xi) 



(13) 



' XTPT V+2 , S (-Xfpy/^) _ 1 U.s(XJp) 

s T v , s (-xrp) s TvtS (-xfP) 

dXil3 t„m-xTP) 

, t VtS (XJP(^-l))-t s(-XjP) _ 1 tv. s (xfP(8-l)) 

Tv.si-XfP) 6 Tv , s (-XfP) 

xTpT v+2 , s (xTP(6-l)y/z3Z) 
S T Vll (-XfP) 



xfp > 0, 



xfp < 0. 
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The expected value for Xi given yi,p \v, E(Xi\yi,p \v), can be computed as 
follows: 



(14) 



S T V , S (X?P) 
"5 T v>e {Xf$) " t " 

c T„ +a , f (-a;?'j3(^--i)v^I?) ^r, 

° T v>s {XfP) 

T v , s {xjP) x * P < u ' 



(15) 



£(A,| yi = 0,/3, u) 



« T ViS (-Xf/3) 

l r v+2 ,3(xT/3(5-i)y3E 

<5 T v 's(-XjP) 



xf P > 0, 



t v+2iS (-xT/3^IE) 



Tv.s(-XfP) 

T VtS (-XjP) 



xfp < 0. 



Denote, 



(16) 
(17) 
(18) 



r* = (E(X 1 r 1 ),...,E(X n r n )), 
W* = diag(E(r^\y, 0(*>, 7) «),..., E^" 1 |y, /3«,7, «)). 
A* = dia 5 (£(Ai|y, /3( f ) , 7, v), . . . , E(X n \y, , 7, v)). 



The new steps for the EM algorithm are as follows. In the it + l)th iteration, 
E-step: Compute W* , A* and r* using equations (8), (12) and (11) respectively. 

M-step: Obtain a new estimate f£ + ^ by replacing the new values of W*,A* 

and r* in eq. (9), where now z* is replaced by r*. If — 11/11/3^11 < A 

stop, otherwise go back to the E-step. We set A = 0.005. 

In both cases, the symmetric and general, we initialized the algorithm by setting 

pt "* = (el + H T H)^ 1 Hy (with e = le — 6), which corresponds to a weakly penalized 
ridge-type estimator. 



4. Simulation study 

In this section we present a simulation study that compares the flexible Student-t 
link (hereafter FST) with the probit and logistic models and also with the recently 
introduced "elasticnet" (Zou and Hastie fig). 

We assess the performance of the models by measuring the misclassification rate 
using simulated datasets. Two examples are presented here. 

Example 1: We generated 10 datasets consisting of 10 predictors and 100 obser- 
vations. The response random variable was generated as a random binomial of size 
1 and probability 0.6 and the 10 predictors were generated as a random binomial 
of size 1 and probabilities equal to 0.3 (in three predictors), 0.5 (in five predictors) 
and 0.8 (in two predictors). 

Example 2: We generated 10 datasets consisting of 10 predictors and 100 obser- 
vations. We allow in these datasets high correlation between the response variable 
and the predictors, and also high correlation within the predictors. We generated 
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the response and the predictors as multivariate normal with mean zero and some 
structure in the covariate matrix and then we dichotomized these variables by as- 
signing a to negative values and 1 to positive values. The assumed covariance 
matrix structure is the following: the response together with the first 4 predictors 
all have correlation equal to 0.8 with each other. The next three predictors have 
correlation 0.3 with each other, and the last three predictors have correlation 0.4 
with each other. The other correlations are all equal to 0.01. 

Our model has three tuning parameters, v (the degrees of freedom of the Student - 
t distribution) that controls the thickness of the tails of the distribution, 7 that 
controls the sparseness of the parameter estimates and S that controls the skewness 
of the distribution. For different values of the parameters (v £ {1, . . . , 8, 15, 30, 
50, 100}, 7 € {0.01, 0.1, 1, . . . , 10, 20, 50, 100} and S G {0.01, 0.1, 0.5, 0.7, 1, 1.2, 1.5, 
2,3,4,5,10}) we estimate the misclassification rate of our proposed model. We 
compare the best performance with the performance of the generalized linear models 
with probit and logit links which were fitted using the R statistical package (see 
results in Table [1]). In Example 1, our flexible t link function FST consistently gives 
better performance than the logit and probit, ranging from 1% improvement in 
dataset 9 to 16% in the first dataset. Note that most of the best models choose the 
degrees of freedom of the Student-t distribution to be equal to 1 i.e. they prefer 
fat-tailed distributions. 

We also compare our FST model with a related method introduced by (Zou and 
Hastie (2005)), the so-called elastic net. We estimated this model over a grid of 
points for the two parameters of the model (A £ {0.01,0.1,1,10,100}) using the 
statistical software R and selected the best performance. The results are shown in 
the last column of Table [T] Our method gives slightly poorer performance only in 
Datasets 9—10. 

We look at the best performance of the flexible probit (approximated by Student - 
t^o) and the logit (approximated by Student-t§) models (results in Table [2]) and 
their choice of the parameters that give best performance. Note that in most cases 
the same set of parameters gives best performance in both models. The three link 
functions consistently choose S — 1.2 in most datasets, i.e. they prefer skewed 
distributions. 

Table [3] shows the results for the simulated datasets in Example 2. FST outper- 
forms the three other methods in general by a substantial margin. We compare the 
sparseness of the elasticnet with the FST. In general, the choice of parameters for 
the best performed FST model, favors small values of the 7 parameter, which do 
not induced sparseness in the model. Compared to the elasticnet, the FST model 
appears to be less sparse. Results are shown in Table 0] for the 10 datasets of Ex- 
ample 2. 

5. Applications to text categorization 

Text categorization concerns the automated classification of documents into cat- 
egories from a predefined set C — {c±, . . . ,c m }. A standard approach is to build 
to independent binary classifiers, one for each category, that decide whether or 
not a document dj is in category Cj for i 6 {1, . . . , to} and j 6 {1, . . . , n}. Con- 
struction of the binary classifiers requires the availability of a corpus of documents 
D with assigned categories. We learn the classifiers using a subset of the corpus 
Tr C -D, the training set, and we test the classifiers using what remains of the 
corpus Te — D/Tr, the testing set. 



87 



Table 1 

Example 1. Misclassification rates for the generalized linear model with probit and logit link 
functions computed using the statistical software R are shown in the first two columns. The third 
column correspond to the best performance achieved, and the following three columns correspond 
to the parameters of the model that achieved the best performance. A * in a cell means that the 
minimum misclassification rate is achieved by more than one value of the parameter. 



Data Misclassification Parameters 





probit 


logit 


FST 


V 


8 


7 


criet 


1 


0.4 


0.4 


0.26 


1 


* 


* 


0.27 


2 


0.27 


0.28 


0.23 


1 


1.2 


0.01 


0.27 


3 


0.4 


0.4 


0.36 


* 


1.2 


0.1 


0.36 


4 


0.4 


0.4 


0.38 


1 


* 


0.1 


0.38 


5 


0.28 


0.28 


0.25 


1 


* 


* 


0.29 


6 


0.4 


0.4 


0.36 


1 


* 


0.01 


0.38 


7 


0.34 


0.33 


0.30 


1 


1.2 


* 


0.31 


8 


0.39 


0.37 


0.31 


* 


* 


* 


0.36 


9 


0.32 


0.32 


0.31 


* 


* 


* 


0.29 


10 


0.33 


0.33 


0.31 


* 


* 


* 


0.28 



Table 2 

Example 1. Best performance achieved by the probit and logit models when approximated by a 
t3o and tg respectively with the correspondent parameters. A * in a cell means that the 
minimum misclassification rate is achieved by more than one value of the parameter. 



Data 


Probit 


Parameters 


Logit 


Parameters 


probit 


8 


7 


logit 


s 


7 


1 


0.28 


1.2 


1 


0.27 


1.2 


1 


2 


0.26 


* 


0.1 


0.26 


* 


0.1 


3 


0.37 


1.2 


0.1 


0.36 


1.2 


0.1 


4 


0.4 


* 


1 


0.4 


* 


1 


5 


0.28 


* 


* 


0.27 


1.2 


1 


6 


0.39 


* 


0.1 


0.39 


* 


0.1 


7 


0.31 


1.2 


0.1 


0.32 


1.2 


0.1 


8 


0.31 


* 


3 


0.32 


* 


3 


9 


0.31 


* 


1 


0.31 


* 


1 


10 


0.31 


* 


1 


0.31 


* 


1 



Table 3 

Example 2. Misclassification rates for the generalized linear model with probit and logit link 
functions computed using the statistical software R are shown in the first two columns. The third 
column correspond to the best performance achieved, and the following three columns correspond 
to the parameters of the model that achieved the best performance. A * in a cell means that the 
minimum misclassification rate is achieved by more than one value of the parameter. 

Data Misclassification Parameters 



probit logit FST v 8 7 enct 



1 


0.08 


0.08 


0.03 


1 


* 


* 


0.06 


2 


0.15 


0.14 


0.06 


1 


* 


0.01 


0.13 


3 


0.13 


0.12 


0.06 


1 


* 


* 


0.12 


4 


0.13 


0.13 


0.1 


* 


* 


* 


0.11 


5 


0.08 


0.08 


0.05 


1 


* 


* 


0.08 


6 


0.11 


0.11 


0.08 


* 


* 


* 


0.10 


7 


0.09 


0.09 


0.07 


* 


* 


* 


0.09 


8 


0.13 


0.14 


0.09 


* 


* 


* 


0.13 


9 


0.15 


0.15 


0.07 


1 


* 


0.01 


0.14 


10 


0.18 


0.18 


0.12 


* 


* 


* 


0.16 



Table 4 

Number of parameters estimates equal to zero in the elasticnet and FST model for each of the 

datasets in Example 2. 



model 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


enct 











2 


5 


3 


9 


6 


3 


7 


FST 


1 











2 


1 


4 








5 



88 



Usually, documents are represented as vectors of weights dj = (x\j, . . . , Xdj) 
where iy represents a function of the frequency of appearance of word u>i in doc- 
ument dj for d words w±, . . . ,Wd in the "ba g of words" . This is the so-called "bag 
of words" representation (see e.g. Mladcnic T2j). 

Due to the large number of possible features or different words that can be 
gathered from a set of documents (usually this could be one hundred thousand 
or more) the classifiers are commonly built with a subset of the words. The text 
classification literature has tended to focus on feature selection (or word selection) 
algorithms that compute a score independently for each candidate feature, this 
is the so-called filtering approach. The scores typically depend on the counts of 
occurrences of the word in documents within the class and outside the class in 
training documents. For a predefined number of words to be selected, say d, the d 
words with the highest score are chosen. Several score functions exist, for a thorough 
comparative study of many of them see Forman Q. We consider the 100 best words, 
according to the information gain criterion. Before we selected these 100 words we 
remove common noninformative words taken from a standard stopword list of 571 
words. 

We performed the experiments in one standard text dataset. The dataset that we 
use comes from the Reuters news story collection that contains 21,450 documents 
that have assigned zero or more categories to them among more than a hundred 
categories. We use a subset of the ModApte version of the Reuters— 21, 578 collec- 
tion, where each document has assigned at least one topic label (or category) and 
this topic label belongs to any of the 10 most populous categories — earn, acq, grain, 
wheat, crude, trade, interest, corn, ship, money-fx. It contains 6, 775 documents in 
the training set and 2, 258 in the testing set. 

To evaluate the performance of the different classifiers we use Recall, Precision 
and the F% measure. Recall measures the proportion of documents correctly clas- 
sified within documents in the same category. Precision measures the proportion 
of documents correctly classified within all documents classified into the same cat- 
egory, and Fi is the harmonic mean of Recall and Precision. There are two ways 
to average Recall, Precision and the F\ measure over all categories, micro-averaged 
and macro-averaged. The micro-averaged is an average weighted by the class dis- 
tribution and the macro-averaged is the arithmetic mean over all categories. All 
three measures depend on a specific threshold which is chosen by either doing cross 
validation or by letting part of the dataset (a validation set) determine the best 
choice, or by fixing it arbitrarily. We set this threshold to be 0.5, i.e., we simply 
classify a document to the category with the highest probability. 

To choose the model that will perform best in new data, we divide the dataset 
into three parts: a training set, a validation set and a testing set. In the training 
set, we fix the tuning parameters and learn the model which is tested using the 
validation set. We repeat this process for every set of tuning parameters. We pick 
the set of tuning parameters that gives the best performance in the validation set. 
Then the algorithm learns the model with the chosen tuning parameters, this time 
using training and validation sets. The performance of the this final model is asessed 
using the testing set. We repeat this whole process 5 times, for different splits of the 
dataset into training-validation-testing sets to evaluate the performance error. We 
utilize 50% of the documents for training, 25% for validation and 25% for testing 
in the 5 splits of the dataset. 

We vary the tuning parameters as follows: v £ {1, 2, 5, 8, 30}, 
7 G {0.01, 0.05, 0.1, 1, 2} and S S {0.1, 0.5, 1, 1.5, 2}. For each category, we pick the 
set of tuning parameters (w, 7, 5) that gives the highest performance according to 
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the F\ -measure in the validation test. Table \5\ shows the values of these parameters 
for Split 1. Note that three of the categories (among the most populous ones) choose 
a symmetric link (<5 = 1). 

The first column of Tables El [7] and [8] shows the micro and macro average of the 
F\ measure, recall and precision respectively, of the performance of the FST models 
in the testing set for the 5 splits of the dataset. 

The second column corresponds to the model with symmetric i-link with 30 

Table 5 

Tuning parameters for each category in the Reuters dataset chosen by Dataset 1. 





earn 


acq 


grain 


wheat 


crude 


V 


8 


2 


1 


1 


1 


7 


0.01 


0.1 


0.05 


0.1 


0.05 


8 


1 


1 


1.5 


0.5 


1.5 




trade 


interest 


corn 


ship 


money-fx 


u 


1 


8 


1 


2 


2 


7 


0.05 


0.01 


0.01 


0.05 


0.05 


8 


1.5 


1 


0.5 


1.5 


1.5 



Table 6 

Micro Fi and macro F\ measures for different link functions. The last two rows show the 
average and standard deviation over the five splits. 



Split 


FST 


Probit 


Loj 


;istic 




micro 


macro 


micro 


macro 


micro 


macro 


1 


0.862 


0.801 


0.855 


0.781 


0.857 


0.786 


2 


0.853 


0.796 


0.849 


0.788 


0.853 


0.795 


3 


0.863 


0.802 


0.859 


0.788 


0.861 


0.793 


4 


0.871 


0.807 


0.864 


0.788 


0.866 


0.795 


5 


0.874 


0.816 


0.867 


0.802 


0.869 


0.804 


mean 


0.865 


0.804 


0.859 


0.789 


0.861 


0.795 


sd 


0.008 


0.008 


0.007 


0.008 


0.006 


0.006 



Table 7 

Micro recall and macro recall measures for different link functions. The last two rows show the 
average and standard deviation over the five splits. 



Split 


FST 


Probit 


Logistic 




micro 


macro 


micro 


macro 


micro 


macro 


1 


0.818 


0.747 


0.790 


0.693 


0.796 


0.703 


2 


0.806 


0.749 


0.785 


0.715 


0.795 


0.730 


3 


0.821 


0.748 


0.796 


0.699 


0.803 


0.710 


4 


0.828 


0.749 


0.803 


0.703 


0.809 


0.715 


5 


0.831 


0.759 


0.805 


0.712 


0.812 


0.723 


mean 


0.821 


0.750 


0.796 


0.704 


0.803 


0.716 


sd 


0.01 


0.005 


0.008 


0.009 


0.008 


0.011 



Table 8 

Micro precision and macro precision measures for different link functions. The last two rows 
show the average and standard deviation over the five splits. 



Split 


FST 


Probit 


Logistic 




micro 


macro 


micro 


macro 


micro 


macro 


1 


0.910 


0.864 


0.932 


0.894 


0.930 


0.891 


2 


0.906 


0.849 


0.923 


0.878 


0.920 


0.872 


3 


0.909 


0.865 


0.932 


0.902 


0.928 


0.898 


4 


0.919 


0.874 


0.934 


0.897 


0.931 


0.894 


5 


0.921 


0.881 


0.940 


0.917 


0.935 


0.907 


mean 


0.913 


0.867 


0.932 


0.898 


0.929 


0.892 


sd 


0.007 


0.012 


0.006 


0.014 


0.006 


0.013 
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degrees of freedom, that approximates the probit link and the third column cor- 
responds to the model with symmetric i-link with 8 degrees of freedom. The last 
two rows of Table \6\ [7] and are the average and standard deviation, respectively, 
accross the five splits of the dataset. 

Note that the best performance is achieved by the FST model in the 5 datasets 
according to the F\ measure and recall. For precision, the best performing model 
utilizes a probit link. 



6. Conclusions 



This paper introduces a flexible Bayesian generalized linear model for dichotoumous 
response data. 

We gain considerable flexibility by embedding the logistic and probit links into 
a larger class, the class of all symmetric and asymmetric i-link functions. The 
empirical results and simulations demostrate the good performance of the proposed 
model. We find that the model with the i-link function consistently improves the 
performance, according to the F\ measure and misclassification rate, as compared 
with the models with probit or logistic link functions. We compare also our model 
with the elastic net which is a related method, and showed that our method in 
general outperforms the elasticnet usually by a substantial margin. 

The FST model being a Bayesian model that can also be interpreted as a penal- 
ized likelihood model, enjoys all the good properties of these models. Shrinking the 
parameter estimates for example, is an important property of these models, which 
has been shown widely that lead to good generalization performance. 

We implemented an EM algorithm to learn the parameters of the model. A 
drawback, is that our algorithm has been implemented to allow only categorical 
predictors. We plan to extent our work to allow for continuous predictors. 

Acknowledgments. We are grateful to David D. Lewis for helpful discussions. 
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